#

Do Elon Musk’s Tweets Have an Influence on Tesla’s Stock Price?


Not long after sending out this tweet on 7 August 2018, Elon Musk was facing a lawsuit by the U.S. Securities and Exchange Commission (SEC) for making “false and misleading statements” and for market manipulation. Musk implied with the tweet that he secured an offer to take Tesla private at the stated price, which was substantially above its actual trading price, although no real arrangements had been made, nor any offers met. According to the SEC his tweet “set off a trading frenzy”, and pushed Tesla’s stock price up by more than 6 percent, forcing the NASDAQ exchange to halt Tesla trading for 90 minutes until the company gave an official response. The company’s stock price closed at $379.57 on the day of the tweet. Two months later, Musk agreed to a settlement which required him and Tesla individually to pay a 20$ million fine and he in addition had to step down from Tesla’s board. 1


Elon Musk by Dr. Seuss (Poem)

The SEC said:
”Musk, your tweets are a blight.
They really could cost you your job,
if you don’t stop
all this tweeting at night."

…Then Musk cried:
“Why? The tweets I wrote are not mean,
I don’t use all-caps and I’m sure that my tweets are clean.”

“But your tweets can move markets
and that’s why we’re sore.
You may be a genius and a billionaire,
but that doesn’t give you the right to be a bore!”

— OpenAI, co-founded by Musk, AI-generated poem


What’s our Goal?

The idea of this R notebook is to introduce everyone interested in data science and machine learning to effective communication of data analysis and statistical findings by leveraging suitable visualisations. On the side, we also take a look at a way to analyse the evolution of Tesla’s stock price and the influence Elon Musk’s tweets had on Tesla’s stock price. For the purpose of visualising analyses and findings, the ggplot2 and plotly packages (as well as some additional packages) are used since they enable producing high-quality, publication-ready visualisations for static as well as dynamic and interactive applications. Both packages are built around the framework of the so-called Grammar of Graphics, a scientific syntax for effective data visualisations, which describes how specific elements or components of a plot should be separated and classified for a structured approach to visualisations. For more information, see Hadley Wickham (2010) - A Layered Grammar of Graphics and Wilkinson (2011) - The Grammar of Graphics.

I can also greatly recommend these following resources:

#

# TODO: Add time of stock split to time series, search for short seller tweets in data, add a log scale plot
# to Tesla's stock price chart, add most recent stock return on distribution, think about colour choice (restrict it)

1 Settings

# Turn off warning messages

options(warn = -1)

# Custom function for checking installation of packages and loading them

install_and_load_package <- function(package) {

    # Check whether package is already installed and if not, install it

    if (!require(package, character.only = T)) {

        install.packages(package, dependencies = T)

    }

    # Load specified package

    require(package, character.only = T)

}

# Specify packages needed for analysis in character vector

packages <- c("conflicted",
              "foreach",
              "doMC",
              "gapminder",
              "httr",
              "rtweet",
              "quantmod",
              "pins",
              "tidyverse",
              "lubridate",
              "tsbox",
              "tidytext",
              "qdap",
              "lmtest",
              "sandwich",
              "caret",
              "DT",
              "ggrepel",
              "plotly",
              "wordcloud2",
              "fmsb",
              "viridis",
              "viridisLite",
              "RColorBrewer")

# Install and load needed packages

lapply(packages, install_and_load_package)

# Conflicted: hierarchy in case of conflict

conflict_prefer("filter", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("first", "dplyr")
conflict_prefer("last", "dplyr")
conflict_prefer("lag", "dplyr")
conflict_prefer("flatten", "purrr")
conflict_prefer("layout", "plotly")

# Parallel computing settings: Using maximum number of available cores

n_CPU_cores <- detectCores()

registerDoMC(cores = n_CPU_cores)

# Color settings

palette(viridis(n = 10))

col_palette_red    <- brewer.pal(n = 9, name = "OrRd")

col_palette_yellow <- brewer.pal(n = 9, name = "YlOrRd")

col_palette_green  <- brewer.pal(n = 9, name = "YlGn")

col_palette_blue   <- brewer.pal(n = 9, name = "PuBu")

2 Data Input

# Some options for quantmod package

options("getSymbols.warning4.0" = F)

To start with, we get Tesla stock data (ticker = “TSLA”) from Yahoo Finance by using the quantmod package. All that is required to download the data is the ticker of the corresponding financial instrument.

getSymbols(Symbols = "TSLA",
           src     = "yahoo",
           verbose = F)
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
## 
## This message is shown once per session and may be disabled by setting 
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## [1] "TSLA"

Second, we also get S&P 500 index (SPY ETF) data (ticker = “SPY”) from Yahoo Finance.

getSymbols(Symbols = "SPY",
           src     = "yahoo",
           verbose = F)
## [1] "SPY"

And finally, we download NASDAQ index data (ticker = “^IXIC”) from the same source.

getSymbols(Symbols = "^IXIC",
           src     = "yahoo",
           verbose = F)
## [1] "^IXIC"

Next, we do some data wrangling to transform Tesla stock data into a tibble with the dplyr and tsbox packages and rename its columns. Tibbles are enhanced data.frames around which the tidyverse packages (and a great many other packages) are built. They provide a standardised way of storing data comming in diverse formats. I also use the pipe operator %>%, to make the workflow and required steps easy to grasp and adjust later on (see picture below for a short explanation).

df_Tesla_stock_data <- TSLA %>%
    ts_tbl() %>%
    ts_wide() %>%
    rename(Date     = time,
           Open     = TSLA.Open,
           High     = TSLA.High,
           Low      = TSLA.Low,
           Close    = TSLA.Close,
           Volume   = TSLA.Volume,
           Adjusted = TSLA.Adjusted)

The Tesla stock data now looks like this, with daily observations for each trading day organised in the rows and seven different variables, also called features in the machine learning (ML) context, in the columns. For each of the daily 2’590 observations, we have the corresponding date in the Date column, the Openning stock price at trading start on the exchange, the daily Highest and Lowest price, the Close at end of trading, the trading Volume, and finally an Adjusted price, accounting for stock splits, dividends, and similar corporate actions.

datatable(df_Tesla_stock_data)

We do the same for the S&P 500 (SPY ETF) index data as well as the NASDAQ index.

df_SPY_data <- SPY %>%
    ts_tbl() %>%
    ts_wide() %>%
    rename(Date     = time,
           Open     = SPY.Open,
           High     = SPY.High,
           Low      = SPY.Low,
           Close    = SPY.Close,
           Volume   = SPY.Volume,
           Adjusted = SPY.Adjusted)

df_NASDAQ_data <- IXIC %>%
    ts_tbl() %>%
    ts_wide() %>%
    rename(Date           = time,
           OpenNASDAQ     = IXIC.Open,
           HighNASDAQ     = IXIC.High,
           LowNASDAQ      = IXIC.Low,
           CloseNASDAQ    = IXIC.Close,
           VolumeNASDAQ   = IXIC.Volume,
           AdjustedNASDAQ = IXIC.Adjusted)

The S&P 500 (SPY ETF) and NASDAQ index series have some more observations than the Tesla series, i.e. both have data points on 3’468 days. Otherwise, it is in the same format. Here is how the S&P 500 time series looks like:

datatable(df_SPY_data)

Finally, we add all three stock price and index time series together to have them available in a single tibble.

df_Tesla_SPY_NASDAQ <- df_SPY_data %>%
    full_join(df_Tesla_stock_data,
              by     = "Date",
              suffix = c("SPY", "TSLA")) %>%
    full_join(df_NASDAQ_data,
              by     = "Date")

2.1 Exercise 1: Get Apple stock data (hint: ticker = “AAPL”) and turn it into a tibble.

#

We now compute the (continuous) stock returns for all three financial instruments.

df_Tesla_SPY_NASDAQ <- df_Tesla_SPY_NASDAQ %>%
    mutate(ReturnsSPY    = log(AdjustedSPY) - lag(log(AdjustedSPY)),
           ReturnsTSLA   = log(AdjustedTSLA) - lag(log(AdjustedTSLA)),
           ReturnsNASDAQ = log(AdjustedNASDAQ) - lag(log(AdjustedNASDAQ)))

In addition, we scrap Tweets data from Elon Musk’s and Tesla’s official Twitter account with the rtweet package. Unfortunately, only the most recent 3’212 tweets per user are available, because Twitter limits access to historical data in order to commercially offer it instead. Tweet scrapping requires a Twitter account and a developer registration for the free Twitter API. This is fairly easy to set up, however, and should only take a couple of minutes (especially if you already have a Twitter account).

df_tweets_elon_musk <- get_timeline("elonmusk", n = 5000)

df_tweets_tesla     <- get_timeline("Tesla", n = 5000)

The Tweets dataset is rather big in size with 90 columns. Thus, only a subset of the columns are shown here to get an idea of how the data set for Elon Musk’s tweets looks like:

df_tweets_elon_musk %>%
    select(created_at, screen_name, text, source,
           is_retweet, favorite_count, retweet_count, hashtags) %>%
    datatable(filter  = "top",
              options = list(pageLength = 5,
                             autoWidth  = F))

…and Tesla’s official Twitter account:

df_tweets_tesla %>%
    select(created_at, screen_name, text, source,
           is_retweet, favorite_count, retweet_count, hashtags) %>%
    datatable(filter = "top",
              options = list(pageLength = 5,
                             autoWidth  = F))

3 Our First Plot - Time Series of Tesla’s Stock Price

Now we’re ready to take the Tesla stock price data and create a basic ggplot2 time series chart. We need the above mentioned Grammar of Graphics to set up each specific component in the plot. First, we need to map the data to so-called aesthetics in the plot. Aesthetics are defined within the aes() function in ggplot2 and include plot specifications such as what goes on the x-axis and y-axis, what is shown in which colour, how the size of an object in a plot is determined and many more. For our basic time series plot, we simply map the Date column from the stock data to the x-axis and the Adjusted stock price to the y-axis. The only additional component to add to get a finished plot now is a so-called geom (short for geometric objects). Geoms determine the kind of plot we want to display and are added with the set of geom_... functions. Here, we’d like to create a simple line plot with geom_line(). First, we add a new component to the plot by using the + operator. Then we set the line geom and after saving the plot to a new R object we have our first plot.

p_basic_time_series_Tesla <- ggplot(data = df_Tesla_stock_data,
                                    aes(x = Date, y = Adjusted)) +  # Close
    geom_line()

p_basic_time_series_Tesla

3.1 Exercise 2: Create a time series plot for Apple’s stock price. You can also try to adjust the axis scales, in case you have any idea how to do it.

#

So far, so good. This is what we get by using ggplot2default settings. However, the plot doesn’t look particularly great, does it? The grey background is rather irritating, the date on the x-axis is only displayed every five years, it’s unclear in what units the y-axis is measured, and in general, there’s no title or anything to really indicate what is exactly shown here. The only information we have is the evolution of the series over a time period of 10 years and its corresponding values on the y-axis. We need to adjust some basic components of the plot.

For a visual overview and corresponding explanations of the different components in ggplot2’s Grammar of Graphics, see this Towards Data Science article:

Since we already defined our data and aesthetics components, we start by adjusting the scales of the x- and y-axes in a new component, the scales component. This ensures, we get proper units and labels for the x- and y-axis. We copy the code from above and additionally add scale_x_... and scale_y_.. functions with proper arguments.

p_basic_time_series_Tesla_w_scales <- p_basic_time_series_Tesla +
    scale_x_date(date_breaks = "1 year",
                 date_labels = "%Y") +
    scale_y_continuous(labels = scales::dollar,
                       breaks = seq(from = 0, to = max(df_Tesla_stock_data$Adjusted, na.rm = T), by = 100))

p_basic_time_series_Tesla_w_scales

The theme of a plot is yet another component in the Grammar of Graphics. Setting a beautiful theme will help us to get rid of the irritating grey background. Let’s try the theme_classic() function.

p_basic_time_series_Tesla_w_scales_and_theme <- p_basic_time_series_Tesla_w_scales +
    theme_classic()

p_basic_time_series_Tesla_w_scales_and_theme

3.2 Exercise 3: Create the plot with the same theme and appropriately adjusted scales for Apple. Try adding a proper title to the plot.

#

theme_classic() is quite a beautiful and simplistic theme. For the purpose of interpreting a time series plot, however, a theme including a grid may be more appropriate. Thus, in the following plots, we use theme_light() instead. We make sure the grid lines stay in the background of the plot by slightly fading them out, since they are only meant as supporting the viewer in identifying the scales on the axes. Next, we would also like to add a proper title. Plot main and subtitles as well as axis labels are set with the labs() function. In addition, we accentuate the x- and y-axis by plotting it in thicker size than the background grid lines. Let’s also adjust the label of the y-axis to make it clearer what it represents. Finally, we add a caption with a copyright for the plot. Now we have our first complete time series plot.

p_basic_time_series_Tesla_w_scales +
    theme_light() +
    theme(plot.title       = element_text(face = "bold"),
          axis.line        = element_line(size = 0.75),  # thicker axes
          panel.grid.major = element_line(size = 0.05),
          panel.grid.minor = element_line(size = 0.05)) +
    labs(title    = "Rising Higher and Higher...",
         subtitle = "Tesla Stock Price",
         y        = "Close (Adjusted)",
         caption  = "© Data Science & Technology Club HSG")

For the following plots, let’s set a global default ggplot2 theme, instead of adding it manually to each plot.

theme_set(theme_light())

To improve further on our plot, we can add a so-called benchmark to it. A benchmark is, e.g., another time series to compare the Tesla stock price to. We use the previously gathered S&P 500 prices to do exactly that. In order to be able to compare the prices of the two series and to get them into the same y-axis limits, some data wrangling and rebasing is required. While the S&P 500 is a sensible measure of the broad overall U.S. stock market to compare Tesla to, one could argue that Tesla is more of a technology company and thus should rather be compared to the NASDAQ index instead. Hence, we also include the NASDAQ index as a benchmark and add text labels and end points for the different time series.

df_Tesla_SPY_NASDAQ <- df_Tesla_SPY_NASDAQ %>%
    mutate(AdjustedTSLARebased    = AdjustedTSLA / first(df_Tesla_stock_data$Adjusted),
           AdjustedSPYRebased     = AdjustedSPY / first(df_SPY_data$Adjusted),
           AdjustedNASDAQRebased  = AdjustedNASDAQ / first(df_NASDAQ_data$AdjustedNASDAQ))

p_time_series_Tesla_vs_SPY <- df_Tesla_SPY_NASDAQ %>%
    ggplot(aes(x = Date)) +
    geom_line(aes(y = AdjustedTSLARebased), col = col_palette_blue[6]) +
    geom_point(aes(x = last(Date),
                   y = last(AdjustedTSLARebased)),
               col   = col_palette_blue[6],
               shape = 1,
               size  = 1.5) +
    geom_text(label = "TSLA",
              aes(x = last(Date),
                  y = last(AdjustedTSLARebased)),
              color = col_palette_blue[6],
              hjust = 1.4,
              vjust = -1) +
    geom_line(aes(y = AdjustedSPYRebased), col = col_palette_green[7]) +
    geom_point(aes(x = last(Date),
                   y = last(AdjustedSPYRebased)),
               col   = col_palette_green[7],
               shape = 1,
               size  = 1.5) +
    geom_text(label = "S&P 500",
              aes(x = last(Date),
                  y = last(AdjustedSPYRebased)),
              color = col_palette_green[7],
              hjust = 1.4,
              vjust = -1) +
    geom_line(aes(y = AdjustedNASDAQRebased), col = col_palette_green[9]) +
    geom_point(aes(x = last(Date),
                   y = last(AdjustedNASDAQRebased)),
               col   = col_palette_green[9],
               shape = 1,
               size  = 1.5) +
    geom_text(label = "NASDAQ",
              aes(x = last(Date),
                  y = last(AdjustedNASDAQRebased)),
              color = col_palette_green[9],
              hjust = 1.4,
              vjust = -2) +
    scale_x_date(date_breaks = "1 year",
                 date_labels = "%Y") +
    scale_y_continuous(labels = scales::percent,
                       breaks = seq(from = 0, to = 110, by = 10)) +
    labs(title    = "Is Tesla's Stock Price an Inflated Bubble - Close to Bursting?",
         subtitle = "Tesla's Stock Price vs. NASDAQ and S&P 500 Benchmarks",
         y        = "Price Rebased (%)",
         caption  = "© Data Science & Technology Club HSG") +
    theme(legend.text      = element_text(),
          plot.title       = element_text(face = "bold"),
          axis.line        = element_line(size = 0.75),
          panel.grid.major = element_line(size = 0.05),
          panel.grid.minor = element_line(size = 0.05))

p_time_series_Tesla_vs_SPY

It is pretty impressive by how much Tesla’s stock price outperforms the (already well performing) S&P 500. In particular beginning in mid October 2019, the volatility of the stock increases immensely, the sharp rise is contrasted by a sharp decline and a sharp rise again. It remains questionable, if Tesla’s recent stock price appreciation is sustainable and warranted in the long run. Let’s highlight the time during which Tesla’s stock price increase was most notable in the chart. We can do this with the annotate geom. Highlighting areas or specific parts of a chart is a useful element in story telling with data (while engaging titles, proper labels, and colours are another part).

p_time_series_Tesla_vs_SPY +
    annotate(geom  = "rect",
             xmin  = as.Date("2019-10-15"),
             xmax  = last(df_Tesla_SPY_NASDAQ$Date) + 35,
             ymin  = -Inf,
             ymax  = Inf,
             col   = "grey",
             alpha = 0.05) +
    annotate(geom  = "text",
             label = "High Volatility Period",
             x     = as.Date("2020-04-15"),
             y     = -3,
             col   = col_palette_red[8],
             size  = 3)

3.3 Exercise 4: Add a title, subtitle, and some text or line annotations to your Apple chart.

#

4 Our Second Plot - Scatter Plot

Next, we turn to one of the most basic, but also most useful plots - the scatter plot. First, however, we compute average mean returns for both financial instruments.

df_Tesla_SPY_NASDAQ_avg <- df_Tesla_SPY_NASDAQ %>%
    summarise(SPY_mean    = mean(ReturnsSPY, na.rm = T),
              TSLA_mean   = mean(ReturnsTSLA, na.rm = T),
              NASDAQ_mean = mean(ReturnsNASDAQ, na.rm = T))

We use geom_jitter() instead of geom_point() since this slightly and randomly dislocates individual observations in order to avoid overplotting, making the individual points better visible. Returns of the NASDAQ go on the x-axis and returns of Tesla on the y-axis. We also highlight yesterday’s return, to see where it stands in comparison to historical returns. The if_else() function is pretty handy for this purpose.

p_scatter_Tesla_NASDAQ <- df_Tesla_SPY_NASDAQ %>%
    ggplot(aes(x = ReturnsNASDAQ, y = ReturnsTSLA)) +
    geom_jitter(aes(col = if_else(Date == max(Date, na.rm = T), "Today", "Historical")),
                alpha = 0.5) +  # geom_point()
    scale_x_continuous(labels = scales::percent) +
    scale_y_continuous(labels = scales::percent) +
    scale_color_manual(name   = "Date",
                       values = c(col_palette_blue[6], col_palette_red[7])) +
    labs(title    = "Is Tesla Related to the Broad Technology Market Index?",
         subtitle = "NASDAQ vs. TSLA Returns",
         x        = "NASDAQ Returns (Continuous)",
         y        = "TSLA Returns (Continuous)",
         caption  = "© Data Science & Technology Club HSG") +
    theme(legend.text = element_text(),
          plot.title  = element_text(face = "bold"),
          axis.line   = element_line(size = 0.75),
          panel.grid.major = element_line(size = 0.05),
          panel.grid.minor = element_line(size = 0.05))

p_scatter_Tesla_NASDAQ

Scatter plots are great to analyse the relationship between two (continuous) variables and are probably the most used charts in research and ML contexts. To check whether a linear relationship between returns of the NASDAQ and Tesla exist, we can in addition add a regression line with geom_smooth(). The method argument is set to lm for linear model. geom_smooth() automatically adds confidence bands, which is pretty handy.

p_scatter_Tesla_NASDAQ +
    geom_smooth(method = "lm",
                col    = col_palette_red[7])
## `geom_smooth()` using formula 'y ~ x'

Did you notice how the regression line immediately became the center of our attention? This is due to the colour choice it’s mapped to in relation to other elements in the plot. Ideally, we use colours in a restrictive way to highlight specific and particularly important aspects in our visualisations.

By looking at the scatter plot and the dispersion of points, however, it is doubtful whether the relationship is truly linear. Thus, we can try to set another model, such as loess (local polynomial regression fitting), in geom_smooth(). loess is a non-linear model (curved regression line).

p_scatter_Tesla_NASDAQ +
    geom_smooth(method = "loess",
                col    = col_palette_red[7])
## `geom_smooth()` using formula 'y ~ x'

Getting back to our relationship between NASDAQ and Tesla returns, when looking at these plots alone, it remains unclear what the true relationship between the returns is. All we can say, is that Tesla on average seems to perform better when the U.S. stock market also performs well. However, the more extreme the returns are, the more uncertainty there is about the relationship, as indicated by the wider confidence intervals. This is due to the comparably little observations we have for extreme returns.

#

4.1 Exercise 5: Create a scatter plot (e.g. with Apple stock returns and trading volumes) and fit a linear or non-linear model to gauge the relationship among the variables.

#

5 Our Third Plot - Bar Chart of Tesla’s Stock Volume

To quickly demonstrate how to build a bar chart, we use the Volume variable in Tesla’s stock data. This is done with geom_col(). In addition, we use colour highlighting to draw the readers attention to certain aspects of the plot.

p_bar_Tesla_stock_volume <- df_Tesla_stock_data %>%
    ggplot(aes(x = Date, y = Volume)) +
    geom_col(aes(fill = if_else(between(Date,
                                        as.Date("2013-05-09"),
                                        as.Date("2014-01-01")),
                                "Steep Increase",
                                "Normal"))) +
    labs(title    = "Trading Volume in Tesla has Increased Greatly, Starting in Mid 2013",
         subtitle = "Tesla Trading Volume",
         caption  = "© Data Science & Technology Club HSG") +
    scale_x_date(date_breaks = "1 year",
                 date_labels = "%Y") +
    scale_y_continuous(labels = scales::dollar,
                       breaks = seq(from = 0, to = max(df_Tesla_stock_data$Volume), by = 50e6)) +
    scale_fill_manual(name   = "Volume Level",
                      values = c("Normal" = "grey36","Steep Increase" = col_palette_red[7])) +
    theme(legend.text = element_text(),
          plot.title  = element_text(face = "bold"),
          axis.line   = element_line(size = 0.75),
          panel.grid.major = element_line(size = 0.05),
          panel.grid.minor = element_line(size = 0.05))

p_bar_Tesla_stock_volume

We can play with the width argument in geom_col to adjust the width of the bins plotted.

p_bar_Tesla_stock_volume <- df_Tesla_stock_data %>%
    ggplot(aes(x = Date, y = Volume)) +
    geom_col(aes(fill = if_else(between(Date,
                                        as.Date("2013-05-09"),
                                        as.Date("2014-01-01")),
                                "Steep Increase",
                                "Normal")),
             width = 0.3) +
    labs(title    = "Trading Volume in Tesla has Increased Greatly, Starting in Mid 2013",
         subtitle = "Tesla Trading Volume",
         caption  = "© Data Science & Technology Club HSG") +
    scale_x_date(date_breaks = "1 year",
                 date_labels = "%Y") +
    scale_y_continuous(labels = scales::dollar,
                       breaks = seq(from = 0, to = max(df_Tesla_stock_data$Volume), by = 50e6)) +
    scale_fill_manual(name   = "Volume Level",
                      values = c("Normal" = "grey36","Steep Increase" = col_palette_red[7])) +
    theme(legend.text = element_text(),
          plot.title  = element_text(face = "bold"),
          axis.line   = element_line(size = 0.75),
          panel.grid.major = element_line(size = 0.05),
          panel.grid.minor = element_line(size = 0.05))

p_bar_Tesla_stock_volume

6 Histogram - Tesla Stock Returns

First, we create a histogram to visualise the distribution of Tesla’s stock returns over time.

p_hist_Tesla <- df_Tesla_SPY_NASDAQ %>%
    ggplot(aes(x = ReturnsTSLA)) +
    geom_histogram(bins  = 200,
                   col   = "white",
                   fill  = col_palette_blue[6],
                   alpha = 0.85) +
    labs(title    = "Histogram",
         subtitle = "Tesla Stock Returns",
         x        = "Continuous Returns",
         y        = "Count") +
    scale_x_continuous(label = scales::percent) +
    theme(legend.text = element_text(),
          plot.title  = element_text(face = "bold"),
          axis.line   = element_line(size = 0.75),
          panel.grid.major = element_line(size = 0.05),
          panel.grid.minor = element_line(size = 0.05))

p_hist_Tesla

Then, we add a density to the distribution.

p_hist_Tesla <- p_hist_Tesla +
    geom_density(kernel = "gaussian",
                 col    = col_palette_red[7])

p_hist_Tesla

Next, we add the average mean and medium return over time.

Tesla_returns_mean   <- mean(df_Tesla_SPY_NASDAQ$ReturnsTSLA, na.rm = T)

Tesla_returns_median <- median(df_Tesla_SPY_NASDAQ$ReturnsTSLA, na.rm = T)

p_hist_Tesla +
    geom_vline(xintercept = Tesla_returns_mean,
               col        = col_palette_yellow[3],
               size       = 1.5) +
    geom_vline(xintercept = Tesla_returns_median,
               col        = col_palette_green[7],
               size       = 1.5) +
    scale_fill_manual(name = "Metric",
                      values = c("Mean" = col_palette_yellow[3], "Median" = col_palette_green[7]))

# Determine y-axis density position of median, mean, and confidence intervals

p_hist_Tesla <- df_Tesla_SPY_NASDAQ %>%
    ggplot(aes(x = ReturnsTSLA)) +
    stat_density(aes(y = ..scaled..),
                 geom   = "line",
                 size   = 0.5,
                 col    = col_palette_blue[6],
                 adjust = 1) +
    labs(title = "Histogram - Tesla Stock Returns",
         x     = "Continuous Returns",
         y     = "Count") +
    scale_x_continuous(label = scales::percent) +
    theme(legend.text = element_text(),
          plot.title  = element_text(face = "bold"),
          axis.line   = element_line(size = 0.75),
          panel.grid.major = element_line(size = 0.05),
          panel.grid.minor = element_line(size = 0.05))

mean_se <- sd(df_Tesla_SPY_NASDAQ$ReturnsTSLA, na.rm = T) / sqrt(length(df_Tesla_SPY_NASDAQ$ReturnsTSLA))

mean_conf_inter_l <- Tesla_returns_mean - 1.96 * mean_se

mean_conf_inter_u <- Tesla_returns_mean + 1.96 * mean_se

mean_pos_y <- ggplot_build(p_hist_Tesla)$data[[1]] %>%
    slice(which.min(abs(x - Tesla_returns_mean))) %>%
    pull(ndensity)

mean_conf_inter_l_pos_y <- ggplot_build(p_hist_Tesla)$data[[1]] %>%
    slice(which.min(abs(x - mean_conf_inter_l))) %>%
    pull(ndensity)

mean_conf_inter_u_pos_y <- ggplot_build(p_hist_Tesla)$data[[1]] %>%
    slice(which.min(abs(x - mean_conf_inter_u))) %>%
    pull(ndensity)

p_hist_Tesla +
    geom_segment(x = Tesla_returns_mean,
                 xend = Tesla_returns_mean,
                 y = 0,
                 yend = mean_pos_y,
                 linetype = "solid",
                 color = col_palette_blue[6],
                 size = 0.4) +
    geom_point(x = Tesla_returns_mean,
               y = mean_pos_y,
               col = col_palette_blue[6])

    # geom_area(x = mean_conf_inter_l,
    #              xend = mean_conf_inter_u,
    #              y = mean_conf_inter_l_pos_y,
    #              yend = mean_conf_inter_u_pos_y,
    #              linetype = "solid",
    #              color = "grey",
    #              size = 0.4)

7 Faceted Time Series Plot

If we want to display multiple series in a single plot, this is best done by using the ggplot2 facets component. It is applied as a separate component in our already existing time series plot. First, however, some data wrangling is required to transform the data from wide to long format.

df_Tesla_stock_data_long <- df_Tesla_stock_data %>%
    select(-Volume) %>%
    pivot_longer(cols      = -Date,
                 names_to  = "Variable",
                 values_to = "Values")

p_time_series_Tesla_faceted <- df_Tesla_stock_data_long %>%
    ggplot(aes(x = Date, y = Values, col = Variable)) +
    geom_line() +
    facet_wrap(. ~ Variable) +
    scale_x_date(date_breaks = "1 year",
                 date_labels = "%Y") +
    scale_y_continuous(labels = scales::dollar,
                       breaks = seq(from = 0, to = max(df_Tesla_stock_data_long$Values, na.rm = T), by = 100)) +
    scale_color_viridis(discrete = T) +
    labs(title   = "Faceted Stock Price Time Series - Tesla",
         y       = "Stock Price",
         caption = "© Data Science & Technology Club HSG") +
    theme(axis.text.x = element_text(angle = 60,
                                     vjust = 0.5),
          panel.grid.major = element_line(size = 0.1),
          panel.grid.minor = element_line(size = 0.05))

p_time_series_Tesla_faceted

8 Interactive Plots with Plotly

To add some more spice to the previously built plots, we can turn them into interactive web graphs. This is where the plotly package comes in play. It is built on the plotly.js (Java Script) library and extremely useful and versatile when it comes to interactive plots used in reports, dashboards or web pages.

p_time_series_Tesla_faceted %>%
    ggplotly()
# FIXME: Annotation doesn't work yet

p_time_series_Tesla_vs_SPY <- p_time_series_Tesla_vs_SPY +
    theme_classic()

p_time_series_Tesla_vs_SPY <- p_time_series_Tesla_vs_SPY %>%
    ggplotly() %>%
    layout(annotations = list(x    = 1,
                              y    = 1,
                              text = "© Data Science & Technology Club HSG"))

p_time_series_Tesla_vs_SPY

9 Elon Musk’s Tweets

We start by computing the number of tweets Elon Musk writes per day and show summary statistics of those.

df_tweets_elon_musk_per_day <- df_tweets_elon_musk %>%
    mutate(Date = as.Date(created_at)) %>%
    group_by(Date) %>%
    summarise(TweetsN = n())
## `summarise()` ungrouping output (override with `.groups` argument)
df_tweets_elon_musk_per_day %>%
    summarise(Min            = min(TweetsN, na.rm = T),
              `1st Quartile` = quantile(TweetsN, probs = 0.25),
              Median         = median(TweetsN, na.rm = T),
              Mean           = round(mean(TweetsN, na.rm = T), digits = 2),
              `3rd Quartile` = quantile(TweetsN, , probs = 0.75),
              Max            = max(TweetsN, na.rm = T)) %>%
    datatable(caption = htmltools::tags$caption(tyle = "caption-side: bottom; text-align: center;",
                                                "Table 1: ",
                                                htmltools::em("Summary statistics of daily tweets by Elon Musk.")))

Next, we create a bar plot to visualise the number of tweets per day.

p_bar_tweets_elon_musk <- df_tweets_elon_musk_per_day %>%
    ggplot(aes(x = Date, y = TweetsN, fill = TweetsN)) +
    geom_col() +
    scale_x_date(date_breaks = "1 month",
                 date_labels = "%Y %b") +
    scale_y_continuous(breaks = seq(0, 60, 10)) +
    labs(title = "How Many Tweets Does Elon Musk Write per Day?",
         x     = "Month",
         y     = "Number of Tweets") +
    scale_fill_binned(name = "Number of \nTweets by \nElon Musk", type = "viridis") +
    theme(legend.text      = element_text(),
          plot.title       = element_text(face = "bold"),
          axis.line        = element_line(size = 0.75),
          axis.text.x      = element_text(angle = 60,
                                          hjust = 1),
          panel.grid.major = element_line(size = 0.1),
          panel.grid.minor = element_line(size = 0.05))

p_bar_tweets_elon_musk <- p_bar_tweets_elon_musk %>%
    ggplotly()

p_bar_tweets_elon_musk

We can compare this to the evolution of Tesla’s stock price.

subplot(p_time_series_Tesla_vs_SPY,
        p_bar_tweets_elon_musk,
        nrows  = 2,
        shareX = T)

Let’s see whether the number of tweets by Elon Musk per day are associated in any way with returns of Tesla’s stock. We naively try to do this with a scatter plot first.

df_Tesla_EM_tweets <- df_Tesla_SPY_NASDAQ %>%
    full_join(df_tweets_elon_musk_per_day,
              by = "Date") %>%
    select(Date, ReturnsTSLA, TweetsN)

p_scatter_Tesla_EM_tweets <- df_Tesla_EM_tweets %>%
    ggplot(aes(x = ReturnsTSLA, y = TweetsN)) +
    geom_jitter(col   = col_palette_blue[6],
                alpha = 0.5) +
    geom_vline(xintercept = 0,
               size       = 1,
               alpha      = 0.1) +
    scale_x_continuous(labels = scales::percent) +
    scale_y_continuous(breaks = seq(0, max(df_tweets_elon_musk_per_day$TweetsN, na.rm = T), 10)) +
    labs(title    = "Number of Daily Tweets vs. TSLA Returns",
         subtitle = "Scatter Plot",
         x        = "TSLA Returns (Continuous)",
         y        = "Number of Tweets per Day",
         caption  = "© Data Science & Technology Club HSG") +
    theme(legend.text = element_text(),
          plot.title  = element_text(face = "bold"),
          axis.line   = element_line(size = 0.75))

p_scatter_Tesla_EM_tweets %>%
    ggplotly()

We can also add a (polynomial) regression line to check for the association. However, as we already suspected from the simple scatter plot, there is no direct relationship visible here.

p_scatter_Tesla_EM_tweets_reg <- p_scatter_Tesla_EM_tweets +
    geom_smooth(method = "loess",
                col    = col_palette_red[7])

p_scatter_Tesla_EM_tweets_reg
## `geom_smooth()` using formula 'y ~ x'

Hence, next, we produce a boxplot with the same underlying data as before. To do this, we need to sort the number of tweets into so-called “bins”. We choose a bin number of 12, thus splitting the number of tweets in bin widths of approximately 5.

p_boxplot_Tesla_EM_tweets <- df_Tesla_EM_tweets %>%
    mutate(TweetsN = cut(TweetsN, breaks = 12)) %>%
    filter_all(~ !is.na(.)) %>%
    ggplot(aes(x = TweetsN, y = ReturnsTSLA)) +
    geom_boxplot(col = col_palette_blue[6]) +
    scale_y_continuous(labels = scales::percent) +
    scale_color_viridis_d() +
    labs(title    = "Number of Daily Tweets vs. TSLA Returns",
         subtitle = "Boxplot",
         x        = "Number of Tweets per Day",
         y        = "TSLA Returns (Continuous)",
         caption  = "© Data Science & Technology Club HSG") +
    theme(legend.text = element_text(),
          plot.title  = element_text(face = "bold"),
          axis.line   = element_line(size = 0.75),
          axis.text.x = element_text(angle = 90,
                                     vjust = 1))

p_boxplot_Tesla_EM_tweets %>%
    ggplotly()

We conclude that there is no clear association between the number of tweets per day and stock returns as there is no rising trend in the binned box plots (note that the plot axes are inverted here). So we can clearly say that our naive comparison of the number of tweets per day to stock returns shows no association.

# Get Tesla tweets

df_tweets_elon_musk_Tesla <- df_tweets_elon_musk %>%
    filter(str_detect(text, pattern = "Tesla"))

So now, let’s dive deeper and take a look at Musk’s infamous “taking-Tesla-private” tweet from 7 August 2018.

p_time_series_Tesla_private_tweet <- df_Tesla_stock_data %>%
    filter(between(Date,
                   as.Date("2018-07-01"),
                   as.Date("2018-09-14"))) %>%
    mutate(Date = as_datetime(Date, tz = "UTC")) %>%
    ggplot(aes(x = Date, y = Adjusted)) +
    geom_line(col = col_palette_blue[6]) +
    geom_point(col = col_palette_blue[6]) +
    geom_vline(xintercept = as_datetime("2018-08-07 12:48:00"),
               col        = col_palette_red[7],
               alpha      = 1)

    # annotate(geom  = "text",
    #          label = "High Volatility Period",
    #          x     = as.Date("2020-04-01"),
    #          y     = -3)

p_time_series_Tesla_private_tweet

# p_time_series_Tesla_private_tweet %>%
#     ggplotly()

We next tokenize the tweets into words. This enables us to quantitatively analyse them.

df_tweets_elon_musk_tokens <- df_tweets_elon_musk %>%
    unnest_tokens(output = words,
                  input  = text,
                  token  = "words")

When we count which words appear most often in the tweets, we see that they are common ones such as “to”, “the”, etc. These are known as stop words and it makes sense to remove them for a meaningful analysis.

df_tweets_elon_musk_tokens %>%
    count(words) %>%
    arrange(desc(n)) %>%
    datatable()

Let’s do just that and voila - the most frequently used words in the tweet now make much more sense and we can actually start using them for further analysis.

stop_words_custom <- tribble(~ word,   ~ lexicon,
                             "http",  "CUSTOM",
                             "https", "CUSTOM",
                             "t.co",  "CUSTOM",
                             "amp",   "CUSTOM",
                             "it’s",  "CUSTOM")

stop_words_final <- stop_words %>%
    bind_rows(stop_words_custom)

df_tweets_elon_musk_tokens_cleaned <- df_tweets_elon_musk_tokens %>%
    anti_join(stop_words_final,
              by = c("words" = "word"))

df_tweets_elon_musk_tokens_cleaned %>%
    count(words) %>%
    arrange(desc(n)) %>%
    datatable()

We visualise the number of times the words occur in Musk’s tweets first with a simple flipped bar plot.

p_bar_flipped_tweet_word_count <- df_tweets_elon_musk_tokens_cleaned %>%
    count(words) %>%
    filter(n >= 70) %>%
    ggplot(aes(x = fct_reorder(words, n), y = n)) +
    geom_col(aes(fill = if_else(str_detect(words, pattern = "tesla"),
                                "red",
                                "blue")),
             alpha = 0.85) +
    geom_text(aes(y     = n + 10,
                  label = n),
              size = 2.5) +
    coord_flip() +
    scale_fill_manual(values = c("red" = col_palette_red[7], "blue" = col_palette_blue[6])) +
    labs(title    = "Tesla Seems Indeed to be Important for Elon Musk… (It's All He Talks about All-Day Long!)",
         subtitle = "Word Counts in Elon Musk's Tweets",
         x        = "Word",
         y        = "Word Counts",
         caption  = "© Data Science & Technology Club HSG") +
    theme(legend.position = "none",
          plot.title      = element_text(face = "bold"),
          axis.line       = element_line(size = 0.75))

p_bar_flipped_tweet_word_count

In addition, we can also add some more labels and descriptions for highlighting purposes.

label_text <- "Words related \n to Tesla \n in Elon Musk's \n tweets"

  1. Source: SEC Statement and Press Release on Tesla↩︎